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Abstract. Due to the catastrophic consequences of tsunamis, 
early warnings need to be issued quickly in order to miti- 
gate the hazard. Additionally, there is a need to represent 
the uncertainty in the predictions of tsunami characteristics 
corresponding to the uncertain trigger features (e.g. either 
position, shape and speed of a landslide, or sea floor defor- 
mation associated with an earthquake). Unfortunately, com- 
puter models are expensive to run. This leads to significant 
delays in predictions and makes the uncertainty quantifica- 
tion impractical. Statistical emulators run almost instanta- 
neously and may represent well the outputs of the computer 
model. In this paper, we use the Outer Product Emulator 
to build a fast statistical surrogate of a landslide-generated 
tsunami computer model. This Bayesian framework enables 
us to build the emulator by combining prior knowledge of 
the computer model properties with a few carefully chosen 
model evaluations. The good performance of the emulator is 
validated using the Leave-One-Out method. 



1 Introduction 



A tsunami is a series of powerful water waves generated 
by earthquakes, volcanic eruptions, underwater landslides as 
well as local landslides along the coast. Their main charac- 
teristic is the high speed of propagation. As emphasized by 
the recent tragic events in March 2011 in Japan and in De- 
cember 2004 in Indonesia, tsunamis may be extremely catas- 
trophic: they are able to destroy buildings, roads and gen- 
erally the infrastructure is seriously affected. But the most 
tragic part is that tsunamis can lead to the loss of human lives. 
A deep knowledge of tsunamis is required in order to predict 



the maximum runups and rundowns, and also to give early 
warning notices to the regions that may be affected. 

Since the most common sources for tsunamis are earth- 
quakes, earthquake-generated tsunamis have been exten- 
sively investigated. Landslide-generated tsunamis have been 
much less studied and the existing knowledge about them 
is more limited. They are characterised by relatively short 
periods, compared to the earthquake-generated ones, result- 
ing to stronger viscous damping. Hence, they do not travel 
as long distances as the earthquake-generated tsunamis do. 
Therefore, one of their characteristics is that their whole life 
cycle takes place near the source. Nevertheless, they can 
reach high amplitudes and can also become extremely harm- 
ful ( |Synolakis efaH|2002{[Tinti et al.||2"0"08] ). The more chal- 
lenging part in landslide-generated tsunami modelling results 
from the fact that they are not instantaneously generated, as 
the earthquake-generated tsunamis are, and their generation 
depends strongly on how the shape of the sea floor changes 
with time (Bard et et aL||2003| ). 



Wiegel ( 1955 ) performed the first experiments for landslide- 
generated tsunamis, where a sliding mass was moved down 
an incline. More recently, it was observed by Liu et al. 



( |2005[ ) that larger wave maximum elevations occur for sub- 
aerial compared to submerged slides. Also, Panizzo et al. 
(2005) showed that the maximum wave amplitude depends 



on both the duration of the underwater motion and the front 
shape of the landslide. Studies about tsunamis generated by a 
sliding mass on a plane beach have also performed by |Lynett| 
and Liu (2005|l. The authors have investigated the whole life 



cycle of the tsunami: initially there is a high amplitude near 
the source, then the wave motion is predominantly near the 
shore, followed by edge waves along the shoreline and no 
motion near the source. 
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Sammarco and Renzi (2008) made an important contribu- 
tion by developing an analytical three-dimensional model for 
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landslide-generated tsunamis based on the forced linear long- 
wave equation of motion, considering a plane beach with a 
constant slope. The inputs of the model are the initial posi- 
tion, speed and spread ratio of the landslide and the output 
is the sea free-surface elevation at specific times and loca- 
tions. Additionally, by comparing to available experimen- 
tal data, they showed that the model represents the overall 
behaviour of the wave with acceptable accuracy. However, 
the predicted water elevations appear to be over-estimated, 
which was attributed to neglecting energy dissipation and 



dispersive effects. Renzi and Sammarco (2012 1 extended the 



landslide-generated tsunami model of Sammarco and Renzi 



(2008 ) to consider arbitrary initial position, speed and spread 
ratio. Furthermore, landslides in their framework can have 
a shape other than Gaussian. They investigated how these 
physical parameters and the shape of the landslide affect the 
resulting wave elevation. Renzi and Sammarco (20121 also 
analyzed the effect of the continental platform on the wave 
elevation. 

This paper presents a proof-of-concept case study for the sta- 
tistical analysis of a landlide-generated tsunami model, by 



employing the analytical model constructed by Sammarco 



and Renzi (2008). The main strategy of the analysis is to 



build a statistical emulator that accurately represents the an- 
alytical model, which can be used for fast predictions, quan- 
tification of uncertainties and sensitivity analysis. In Sec- 
tion 2, a more detailed explanation of the statistical emulator 
is presented. Section 3 describes the concept of a special 
form of emulator, named the Outer Product Emulator. An 
analytic description for the appropriate parameter selections 
and calculations required to build it are also presented. Sec- 
tion 4 describes the concept of the experimental design and 
its implementation. Section 5 shows the application of the 
Outer Product Emulator and its validation for the Sammarco 
|and RenzT| ( |2008| l analytical model. The resulting emulator is 
then used for extremely efficient sensitivity and uncertainty 
analyses in Section 6. 



2 Statistical emulator 



An emulator is a simple statistical model that approximates 
a simulator, where a simulator is a deterministic input-output 
computer model (analytical model, complex statistical -e.g. 
stochastic- model, or most commonly a numerical solver of a 
large system of equations such as PDEs). Given some inputs 
x, the simulator output is given by y = /(x) and the emulator 
is denoted by /(x), which indicates that it is an approxima- 
tion of the simulator. In most cases, running simulators is 
very time and resource consuming, so one can only afford 
a limited number of runs. The use of emulators comes as a 
solution to this problem, since emulators run almost instan- 
taneously. However, due to the fact that they are approxi- 



mations of the computer model, some error is introduced by 
using them. So, emulators are recommended to be used only 
in the case when the simulator is expensive to evaluate. The 
error amount can be estimated since they can make prob- 
abilistic predictions of the output that the simulator would 
produce if it was exercised over certain regions of the input 
space. Therefore, the main use of statistical emulators is for 
fast predictions of the simulator output. 

Analyses such as uncertainty and sensitivity analyses, as well 
as calibration, require a large number of evaluations of the 
expensive simulator and this means that they can become im- 
practical. An emulator can be built and used to make such 
demanding analyses more efficiently. The uncertainty anal- 
ysis provides us with a knowledge of the distribution of the 
simulator output. The sensitivity analysis investigates how 
each of the inputs affect the output. Calibration consists of 
fitting a model to the available observations by adjusting its 
parameters (we are not considering calibration in this paper). 

The emulator is created by employing a number of simulator 
evaluations. The error in its predictions is inversely related 
to the number of simulator evaluations. Therefore, a signifi- 
cantly large number of evaluations can make this error negli- 
gible, but this is unusual due to the simulator computational 
complexity. Also, since the emulator represents a determin- 
istic model, it is also a deterministic model where the simu- 
lator has been exercised: it predicts perfectly, with zero error, 
the output at points that have been used in the creation of the 
emulator. At new points, the emulator gives a distribution 
for /(x) with mean value /(x) and standard deviation which 
represents the error in the prediction and hence how close it 
is likely to be to the true simulator output /(x). 

Bayesian statistical analysis, through the emulators, can be 
much more efficient than other methods to quantify uncer- 
tainties, e.g. the standard Monte Carlo method for which 
the simulator must be run repeatedly. In a Bayesian analysis 
we first build a representative emulator for the simulator and 



then use it for further analysis. Oakley and O'Hagan (2002 



2004) and O'Hagan (2006l focused on a Bayesian approach 
for uncertainty and sensitivity analysis. They concluded that 
a Bayesian approach is more efficient than the Monte Carlo 
method as it uses a significantly smaller number of model 
runs. One can take advantage of this by running the model at 
higher resolution. 

The form of the emulator used in this analysis is the Gaussian 
Process (GP). A GP is an extension of the familiar and pop- 
ular Normal distribution, also called Gaussian. Nice math- 
ematical properties of the Normal distribution carry over to 
the GP and therefore the GP is the principal tool for creating 
an emulator, together with prior knowledge about the simu- 
lator. It is worthy to say that the term "prior knowledge" is 
used to indicate the initial beliefs about the simulator before 
the use of the available data. An unknown function /(.) has 
a GP distribution if for any set of input points {xi,...,x n }, 
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the set of outputs {f(xi),...,f(x n )} follows a multivariate 
Normal distribution. The simulator is represented by a GP 
distribution with mean function mo(.) and co variance func- 
tion Vo(-,.)> i- e - 



/(.)|/3,a 2 ,B~GP(mo(.),^(.,)) 



(1) 



where the symbol ~ stands for "is distributed as". The mean 
function is described by 



mo(x) = h(x) /3, 



(2) 



in which h(.) is the set of regression functions and f3 is the 
vector of the unknown coefficients. The functions h(.) are 
chosen to represent the main form of the actual simulator 
/(.). The covariance function, which generates some addi- 
tional variations as well as uncertainty, is given by 



V (x,x')=cr 2 C(x,x';B) 



(3) 



where C(.,.;B) is a correlation function whose shape is 
known but with unknown correlation parameters B, also 
called hyperparameters. A common choice for C(.,.;B) is 



C(x,x';B) =exp{-(x- x') 1 B(x-x')} 



(4) 



where B is a diagonal matrix of the so-called smoothing pa- 
rameters b u . The inverse square roots of these parameters, 
1/VDji, are known as the correlation length scales. The bu 
(or the correlation length scales) describe how rapidly the 
output responds to changes in each input; the correlation 
lengths scales give an indication of the distance in the input 
space for which correlation between the simulator outputs is 
either significant or negligible. 



3 Outer Product Emulator 

In the case where the simulator has multiple outputs, the cre- 
ation of a surrogate model is more complicated. The simplest 
approach is to build separate independent emulators for each 
output. However this method has a major drawback: it ig- 
nores the correlations between the outputs. |Rougier| ( |2008| l 
proposed an approximate multivariate emulator, named the 
Outer Product Emulator (OPE), that creates one emulator for 
all the outputs, simplifying the process by using separable 
functions in inputs and outputs. 

Therefore, the main advantage of the OPE is that the building 
cost is significantly smaller compared to a general multivari- 
ate emulator. The construction and use of an OPE can be 
fast, even in the case where many simulator evaluations and 
a large number of outputs exist. This property of the OPE is 
very important for the case investigated in this work. Indeed, 
the wave shape is not oscillating periodically and hence the 
frequency of the oscillation is not constant, so a large num- 
ber of simulator evaluations is necessary. We have to run 



the simulator at small time steps and hence a large number 
of evaluations are collected to describe the outputs. This is 
the primary reason why we decided to use the OPE for the 
analysis. 



Rougier et al. ( 2009| l describe further this special form of 
statistical emulation. The OPE has the form: 



fi(r) = Y[Pj9j(r,Si) + e(r,8i) 



3=1 



(5) 



where fi(r) is the i th simulator output at input r, gj is the 
set of regressors, /3j are the unknown coefficients and e is the 
residual. Additionally, Si represents the output domain - e.g. 
time, space - corresponding to the i th simulator run. 

In order to build an emulator, appropriate distributions for 
(3 and e must be chosen. A convenient choice is the Normal 
Inverse Gamma distribution that enables the use of conjugacy 
(so posterior estimates can be computed explicitly without 
resorting to Markov Chain Monte Carlo as in more standard 
fully Bayesian emulators), described by 



0\T,B~N(m,TV) 

£ |t,B~GP(0,t Ka (.)) 

T\B~IG(a,d) 



(6) 

(7) 
(8) 



where B = {m,V,a,d,K\(.)} is the set of the hyperparame- 
ters and K\(.) is the covariance function of the residuals with 
correlation lengths A. Also, N and IG denote the Normal 
and Inverse Gamma distribution, respectively. Summing up, 



{0,e}~NIG(m,V,a,d) 



(9) 



where the hyperparameters a and d denote the degrees of 
freedom and the scale respectively. 

Furthermore, a choice for the appropriate regression £,•(.) 
and covariance functions of the residual K\(.) is needed. 
There are two main characteristics that distinguish the OPE 
from a standard multivariate emulator. The first is that the co- 
variance function of the residuals is separated in inputs r and 
outputs s. This property can be represented by the equation 



K\(r,s,r',s') = Kl(r,r') x k s x (s,s') 



(10) 



The second characteristic is that the set of the regressor 
functions, G, is the outer product of the set of regres- 
sors for inputs, G r ~{g 7 ,(r)} l ' r =1 , with the set of regres- 
sors for outputs, G s = {g1 B (s)} l ' s =1 , where the expression 

a=/3 indicates that the term a is equal by definition to the 
term j3. Therefore, the functions gj are given by gj (r, s) — 
9j r ( r ) ®#f s ( s )' where <Ei is the outer product symbol and 
j = {!,... ,v}, where v = v r x v s . 
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3.1 Maximizing the marginal likelihood 



3.2 Hyperparameters selection 



In order to find the most accurate representation of the sim- 
ulator, appropriate values for the correlation lengths and 
other unknown parameters can be estimated by maximis- 
ing the corresponding marginal likelihood (Rasmussen and 
[Williams] [2006 ) before getting posterior ditributions of emu- 
lated simulator outputs. In the application described in Sec- 
tion 5, this technique is used to obtain representative values 
for the four correlation lengths, one for each of the three in- 
puts and one for the output. Starting from the general equa- 
tion of the emulator, that is 



y = f(x) = h(x)+e(x) 



g(x) T (3 + e(x) 
Q(x)/3+e(x), 



(11) 



where 

e~GP(0,r KA ), (12) 

0~N(O,tV), (13) 

we assume that the mean value of the unknown coefficients 
is zero and also that V can be defined as V = a 2 1, with the 
common multiplier parameter to be described by 

T~IG{a,d) (14) 

Therefore, this reformulation of the prior distributions en- 
tails that the regression functions multiplied by the unknown 
coefficients /3, i.e. the function h(.), has a Normal prior dis- 
tribution given by 

h\B~N{0 lT QVQ T ) (15) 



The likelihood function is described as follows: 
y\h,B ~ N(h,TK X ) 



(16) 



The marginal likelihood ( can be obtained from the integral 
of the likelihood times the prior, i.e. 



P(V\B)= p(y\h,B)p(h\B)dh 



(17) 



Therefore the marginal likelihood has a Normal distribution 
described by 

y\B~N{0,TK X + TQVQ T ) (18) 

Consequently, the log marginal likelihood function is 

A = \og{p{y)) = - l -I T C- 1 f- l -\og\C\+ constant (19) 

where C ' = r (k\ + QVQ T ) . The derivative, with respect to 
the correlation lengths, of the log marginal likelihood is given 
by 

1 

2 ; ° d\~ J 2"' v ~ d\ 
In order to calculate C _1 , the Cholesky decomposition is 
used. Optimization methods are used to help us with the 
maximization of the marginal likelihood function in order to 
find correlation lengths. 



VA^C-^C-V-W 1 ^) (20) 



The final step in the process of building the prior emula- 
tor for the simulator is the selection of the hyperparame- 
ters {m,V,a,d}. To determine adequate hyperparameters, 



the simple approximation method presented by Rougier et al. 



(2009) is used. The idea is to average the simulator output 
fi(r) over the inputs r and output i, which means that /»(r) 
is replaced by f(x), and also to assume that x has a uniform 
distribution. Using the mean and variance of the simulator 
output, f(x), the hyperparameters are estimated. Complet- 
ing the selection of the hyperparameters yields the prior em- 
ulator. 

The prior emulator is combined with a sample of simulator's 
evaluations, called the training sample, giving the posterior 
emulator. The resulting emulator gives a prediction distribu- 
tion for each point in the evaluations' output domain. These 
predictions are Student-t distributed with parameters (mean, 
variance and degrees of freedom) that are calculated accord- 
ing to the procedure explained in Rougier (2008). 



After building the emulator, the next step is to test how accu- 
rately it represents the simulator. This process is called vali- 
dation, and it is recommended to be performed before mak- 
ing use of the emulator. We use the so-called "leave-one-out" 
diagnostic (LOO): one evaluation is left out and predicted 
using an emulator constructed from the rest of the training 
data set. We repeat this for all the evaluations. Therefore, 
the ability of the emulator to represent the simulator can be 
quantified. 



4 Experimental Design 

One of the most important steps in the analysis is the exper- 
imental design. This is the process of finding a space filling 
design that covers the input space sufficiently. Due to the fact 
that the input points are selected strategically, the amount of 
useful information passed to the emulator can be maximized. 
Hence, the required number of simulator runs for an accu- 
rate emulator can be reduced, resulting in a more efficient 
procedure. 

Many different experimental designs exist. The simplest one 
is the regular grid, where equally spaced points are selected 
for each parameter. However, even with the simplicity of 
this design, some drawbacks exist by using it. The most 
important one is its "collapsing" property, where multiple 
grid points have the same coordinate value when projected 
onto a parameter axis. This means that a limited informa- 
tion is obtained from these points. For example, for a three- 
dimensional input space, in order to obtain n distinct evalu- 
ations for each of the three parameters, the total number of 
required simulator runs is n 3 , which is highly inefficient. 
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The Latin Hypercube design (LHD) is an experimental de- 
sign that is constructed to avoid the "collapsing" property of 
grids. The LH design selects n different sample points from 
each of the k variables Xi,...,Xk using the following pro- 
cess. First of all, the range of each variable is divided into n 
equal probability and non-overlapping intervals. Then, one 
value from each interval is selected randomly with respect to 
the probability density of the interval. The n values obtained 
for X\ are paired randomly with the n values for X-2- These 
n resulting pairs are then combined randomly with the n val- 
ues for X 3 resulting into n triplets. The same process contin- 
ues until n k— tuplets are formed, which is the LH sample. 

However, only a subset of LH designs are space filling. To 
ensure a space filling input selection, we adopt the so-called 
"maximin" Latin Hypercube Design. The specific design 
follows the same process as the LHD to choose the sample 
points, although it has an additional constraint that is to max- 
imise the minimum distance between the points. Therefore, 
a maximum coverage of the input space is achieved. 

Urban and Fricker ( 20 IP} made a comparison of the Latin 
Hypercube with the regular grid design for the multivariate 
emulation. They report that the emulators built using the 
LHD make significantly improved predictions relative to the 
emulators created using a regular grid training sample. Fur- 
thermore, they concluded that the LH emulators are more ac- 
curate compared to the regular grid emulators in sensitivity 
analysis of a single-parameter model. 



5 Application to the SR tsunami model 

5.1 Model description 

In this section the methods described above are applied to 
find an accurate statistical representation of the landslide- 
generated tsunami analytical model of |Sammarco and Renzi| 
(2008 ), abbreviated as the SR model. This model takes as in- 
puts the initial position x , the speed u and also the spread 
ratio or shape c of the landslide, where the "spread ratio" 
is defined as the ratio of the landslide's characteristic length 
over the characteristic width. FigurefTlillustrates this specific 
analytical model set up. 

All the coordinates, functions and parameters used in the 
model are non-dimensional: 



x 
a 




(21) 



u = 



'o- 



X 



denotes the landslide maximum vertical thickness, £ is the 
the non-dimensional sea free-surface elevation, A is the land- 
slide characteristic width, t is the time and g is the accelera- 
tion due to gravity. 

When the landslide starts moving from the origin, which is 
the position where the sea surface meets the sloping beach, 
Xq is equal to zero. Also, negative values of Xq indicate that 
the landslide initiates from a subaerial position, whereas pos- 
itive values of xq indicate submerged slides. The output of 
this model is the sea free-surface elevation of the wave at 
given time and location. A plane beach with constant slope is 
considered and it is important to notice that the landslide con- 
tinues to move even after it falls into the water. This causes 
the existence of high wave elevations even at large times. 




Fig. 1: Sketch illustrating the landslide's motion as consid- 
ered in Sammarco and Renzi's analytical model. The y'-axis 
represents the shoreline, while the x'-axis is perpendicular to 
it. 



By considering this model, Sammarco and Renzi ( 2008| > 
came to the conclusion that the landslide generates a wave 
field that is composed by two components, oscillatory and 
evanescent. The life cycle of the wave can be visualized in 
Fig. [2l where the sea free-surface elevation of the landslide- 
generated tsunami wave is shown in polar coordinates at 
times t = 0.5,1,1.5,2,2.5,3,5,10,20. The initial position of 
the landslide is at the origin, the speed is equal to 1 and the 
spread ratio of the landslide is equal to 2, which means that 
the characteristic length is twice the size of the characteristic 
width. 

When the landslide occurs, it displaces water forward and 
an elevation wave is generated, that propagates mostly in 
the offshore direction. Also a depression wave occurs near 
the origin (see Fig. 2a i. Later on, the elevation wave 



spreads along the shoreline, while the depression wave ex- 



tends around the origin (see Figs 2b 2c 2d I, At larger times, 
a second elevation wave is generated at the origin and the de- 
pression wave spreads out (see Figs [2f| 2g I. Finally, at even 



where the primes denote dimensional values, a is the land- 
slide characteristic horizontal length, s is the beach slope, r/ 



larger times, the wave motion is dominated by edge waves 
propagating along the shoreline, with no motion around the 
origin (see Fig. 2h 2j). From this study, it is concluded that 
the first generated waves are not those with the larger ampli- 
tude. This indicates that in order to capture the maximum 
elevation, the model has to be evaluated up to a significantly 
large time t. 
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Fig. 2: Sea free-surface elevation of the landslide generated tsunami observed at different times with non-dimensional inputs 
(xq,uo,c) = (0,1,2). The horizontal axis represents the shoreline and the vertical axis points to the offshore direction. 



5.2 Training sample 



In this work, a statistical emulator is constructed looking 
at specific locations; meaning that its output is only time- 
depended. Specifically, seven locations along the shoreline 
(x = 0) at y = 2,4,6,7,8,8.38 and 10 have been investigated. 
The time domain is selected to be between and 35. Small 
time steps are required in order to have sufficient information 
to capture the wave shape with sufficient detail: specifically 
dt = 0.2 was chosen for the analysis. 

The first step of the analysis is the experimental design. Us- 
ing the "maximin" Latin Hypercube design method, as de- 
tailed in Section 4, forty points, (xq,uo,c), are chosen to 
cover the three-dimensional parameter space. This is a com- 
promise in order to have a significantly good coverage of 
the design space as well as a significantly small computa- 
tion cost. The input domain is chosen to be x e [—3,1], 
u € [1,2] and cG [0.5,3], 



The positions of the forty inputs in the parameter space are 
shown in Fig. [3] The colour at each point indicates the 
maximum sea free-surface elevation, for the location x = 
and y — 8.38, i.e. along the shoreline and far away from 
the source. The figure shows that the maximum wave el- 
evation significantly depends on the landslide's speed: the 
larger the speed uo, the larger the maximum elevation. Fur- 
thermore, it can be observed that the maximum wave eleva- 
tion shows higher amplitudes when the landslide starts from a 
subaerial close to the origin position and also when the land- 
slide spread ratio is less than 2. However, the dependence 
of the maximum elevation on the initial position and spread 
ratio of the landslide is not as obvious as that on the speed. 

For example consider points 13 and 25. They both represent 
landslides characterised by high speed and spread ratio close 
to one. However point 13 is a subaerial case while point 25 is 
a submerged one. This yields a significant difference in the 
maximum sea free-surface elevation, with the subaerial case 
being much higher. 
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Fig. 3: Maximum sea free-surface elevation at the loca- 
tion (x,y) — (0,8.38) for time between and 35 for each 
of the 40 design input points selected using the "maximin" 
LHD method. Three quantities are varied: the landslide's 
speed, its initial location and its shape, that are given in non- 
dimensional form as in Eq. (21). 



The simulator's evaluations for the other six locations along 
the shore yields similar conclusions about the dependency of 
the maximum sea free-surface elevation to the input parame- 
ters. 



5.3 OPE prior choices 



The next step in the analysis involves the appropriate prior 
choices for the regression and residuals covariance functions 
for inputs r and outputs s. In the case of the SR model, r is 
equal to (xo,uq,c) and s is time t. The set of input regression 

functions, G r ={g\,...,g1 }, where v r is the number of input 
regressors, consists of a combination of appropriate choices 
of polynomials for each of the three input parameters. For 
each input parameter, a linear and a quadratic polynomial, 
plus a constant term, are chosen, resulting to a total of seven 
input regressors. Since the simulator's output variation with 
respect to r is smooth, the use of higher order polynomials is 
unnecessary, which would additionally increase the prior un- 
certainty. The chosen polynomials are shifted into the unit in- 
terval [0,1] and their coefficients are selected so that the two 
functions for each input parameter are orthonormal with a 
uniform weighting function. Combining all the inputs' func- 
tions, the set of chosen input regressors is the following: 



G r ={1,V3 



(2:o + 3) 



-3V5 



(2-0 + 3) 



-4\/5( 



£q + 3^ 



4 

V3(u Q - l),-3>/5(«o- l)+4>/5(«o - 1) 2 , 

c-0.5 



Vz 



(c-0.5) 
2.5 : 



-3\/5 



(c-0.5) 
2.5 



-4V5 



2.5 



) 2 } (22) 



After choosing the regression functions for the inputs, we 
need to make an appropriate choice for the regression func- 
tions for the output, G s = {g(,...,gf, }, where v s is the num- 
ber of output regressors. Fourier terms are chosen of the form 
sin(2|^) and cos(4^), where T is the period of the oscilla- 
tion, in addition to a constant term. However, since the sea 
free-surface elevation waves do not oscillate with constant 
period, this selection is challenging. To make this selection, 
we consider the range of oscillating frequencies present in 
the wave and using the LOO diagnostic method (explained 
in more detail in Section 5.4), we choose the smallest set of 
frequencies that give the most accurate predictions, since as 
for the case of input regressors an unnecessary large number 
of regressors is not desirable. The selected set of frequencies 
is the following: { g , g , j , 5, 5 }■ Therefore, the set of output 
regression functions is given by 

G s = {l,sin(7rf/3),cos(7rt/3),sin(27rt/5),cos(27rt/5), 

sin(7rf/2) ,cos(7rf/2) ,sin(2rri/3) ,cos(27rf/3) , 
sin(7rf),cos(7rf)} (23) 
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Power exponential functions are chosen for input and output 
residuals covariance functions, n r and n s : 



k t = exp(-( |:ro X " l ) 3/2 ) x exp(-( K "Q 1 ) 372 ) 
xexp(-(^l) 3 / 2 ) 



and 






(24) 



(25) 



respectively, where Aa;, A u , A c represent the correlation 
lengths for inputs and X t denotes the output (i.e. time) cor- 
relation length. The values of the correlation lengths can be 
varied in order to adjust the fit of the emulator. The corre- 
lation lengths are chosen by maximizing the marginal likeli- 
hood. Since r appears in the equation of the marginal likeli- 
hood ( fl9) , in order the process of maximizing the marginal 
likelihood to be feasible, r has been treated as a constant 
and estimated by the process simultaneously with the corre- 
lation lengths. The estimated value for r is not used further 
in the analysis since r was considered as constant only for 
practical purposes for this process and it is everywhere else 
considered as a scalar variable that is described by an Inverse 
Gamma distribution. Furthermore, note that the 3/2 expo- 
nent is chosen so that the covariance is smooth enough, but 
not too much as the usual choice of square power is infinitely 
smooth and hence may not be realistic for such a complex 
simulator. 

The last step for the creation of the prior emulator for the 
SR model is to make a choice for the values of the hyper- 
parameters {m,V,a,d}. To do so we follow the method de- 
scribed by Rougier et al. (2009). We have already assumed 
m = 0. The hyperparameter a, which is equal to the number 
of degrees of freedom, takes the value 3 in the case of the 
SR m odel. Also, after th e simple calculations recommended 
by Rougie retakl ( |2009| , it is concluded that a 2 = 0.257 and 
d = 0.208. Hence, V can be easily obtained from V = a 2 1. 
By fixing these parameters, the creation of the prior emulator 
is completed. Using the evaluations of the 40 selected design 
points, the prior emulator is updated to obtain the posterior, 
which is the statistical emulator. Evaluating the statistical 
emulator at a given input point, (xo,Uo,c), results in predic- 
tions of the output's distribution for all the points in the time 
domain, in this case from to 35, every 0.2 time step, i.e. 
176 prediction distributions. 

5.4 Emulator's validation 

After the creation of the emulator, the LOO validation 
method is applied, resulting in 40 LOO diagnostic plots. 
These diagnostics give information about the predictive 
power, capabilities and shortcomings of the emulator, since 
we can estimate the amount of the error induced by using 



the emulator instead of the simulator. Some of the diagnos- 
tic plots for the location (x,y) — (0,8.38) are shown in Fig. 
[4] Similar diagnostic plots are created for all the other lo- 
cations investigated. In general, the LOO diagnostics allow 
us to conclude that in most of the cases the emulator predicts 
very well the simulator evaluations, capturing both shape and 
the maximum wave elevations (peaks). Additionally almost 
always the simulator's evaluation line is within the 95% pre- 
diction credible interval (ideally it should be within this in- 
terval 95% of the time). 

However, on some of the diagnostic plots, the prediction is 
not very accurate. One of the fundamental reasons affecting 
the emulator performance is the position of the point at which 
we try to predict in the input space. Generally, it is expected 
to obtain more accurate predictions in the cases where the 
points at which we try to predict are surrounded closely by 
other design points, compared to the cases where the points 
are located in a sparsely covered region, since more infor- 
mation can be obtained by the neighbouring points. The be- 
haviour at each point is significantly linked to the behaviour 
at the points close to it and this influence decays rapidly with 
the distance separating the two points. To quantify this, the 
Euclidean distances in the three-dimensional input space be- 
tween a point and the other 39 points are obtained. Then 
the mean values of these distances (MED) for each of the 40 
input points are calculated: 



MED: 



J2i=i VA^i - x z) 2 + ( u i - u 2) 2 + (ci - c 2 ) 5 



39 



(26) 



FigureBJdisplays the mean Euclidean distances for all the de- 
sign input points. We can see that the points 8, 10, 12 and 25 
show a large MED from the rest of the 39 points. Looking 
at the LOO diagnostics of these four points in Figs |4aj [4b] 
4c 4f] we can easily observe that the predictions are not very 
accurate. However, the maximum wave elevation, which is 
the most important measurement, is still satisfactory and al- 
most everywhere the simulator evaluation lines are within the 
95% credible intervals. This indicates that, even for the de- 
sign points that are isolated from the neighbouring points, the 
emulator predictions are still usable. 

On the other hand, points such as 19, 24, 27 and 36 are af- 
fected significantly by the other points, separated by small 
Euclidean distances from the rest of the 39 points in space. 
Looking at the diagnostic plots of these points (Figs Hd] He] 



4g 4h i, it is obvious that the emulator does an excellent job 



in prediction, since all the features of the wave are predicted 
accurately by the emulator. 

Two measures that can be used to quantify the emulator's 
accuracy are the mean credible interval length (MCIL) and 
the root mean square error (RMSE) between the observed 
and the predicted evaluations at each of the 40 input points. 
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(a) (x ,U ,c) = (-2.88,1.06,2.82) -point number 8 
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(c) (xo,uq,c) — (0.70,1.36,0.63) -point number 12 



(d) (xo,Uo,c) = (—0.45,1.77, 1.76) -point number 19 




(x0,u0,c)=(-077, 1.84, 1.49) 
mean posterior value 
95% CI 




(X0,u0,c)-(0.98. 1.94, 1.06) 
mean ppsterior value 
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(e) (xo,Uo,c) = (-0.77, 1.84, 1.49) -point number 24 



(f)(xo,«o,c) = (0.98, 1.94, 1.06) -point number 25 



Fig. 4: Diagnostic plots for some of the input points looking at (x,y) = (0,8.38). Blue line is the simulator's evaluation, red is 
the mean value of the posterior distribution and dotted grey is the 95% credible interval of the posterior distribution. 
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(h) (xo,uo,c) = (—0.86,1.67,2.15) -point number 36 

Fig. 4: Diagnostic plots for some of the input points looking 
at (x,y) — (0,8.38). Blue line is the simulator's evaluation, 
red is the mean value of the posterior distribution and dotted 
grey is the 95% credible interval of the posterior distribution. 



The RMSE is given by the equation 



RMSE ■■ 



A-,i=l\ X i X i) 



(27) 



where Xi and Xi are the observed and predicted values at each 
time step i, respectively, and n is the number of time steps. 

Figures [6] and [7] display the MCIL and the RMSE versus 
MED, respectively, for all the input points, looking at the 
case of the location (x,y) = (0,8.38). We observe a positive 
correlation between the MED and both the MCIL and the 
RMSE. Therefore, this confirms that the distance separating 
the points in space is a fundamental factor that affects the 
predictive power of the emulator and hence this highlights 
the importance of a good experimental design. This positive 
correlation is also satisfied for the other locations examined. 
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Fig. 5: Euclidean distance between each of the points and the 
other 39 points in the three-dimensional parameter space. 



In Fig. [8] the RMSE with respect to MCIL is presented for all 
the 40 diagnostics for the seven locations along the shoreline 
investigated, in order to compare the emulator's performance 
when applied to different locations. A combination of both 
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Fig. 6: Mean Euclidean distance vs. mean 95% credible in- 
terval length for the location (x,y) = (0,8.38), where the dot- 
ted line is the linear regression. 
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Fig. 7: Mean Euclidean distance vs. RMSE for the location 
(x,y) = (0,8.38), where the dotted line is the linear regres- 
sion. 



small RMSE and MCIL is desirable, indicating both small 
error and small uncertainty in emulator's predictions. The 
figure clearly shows that the emulator performs similarly for 
all the locations investigated. Therefore, the emulator can be 
applied to different locations along the shoreline, resulting in 
accurate enough representations of the simulator output. The 
reasons that we have slightly better predictions at some loca- 
tions compared to others is an area of further investigation. 
Nevertheless, the location along the shoreline with y = 8.38 
shows the worst results in this Figure. Therefore, the predic- 
tions of the emulator for the other locations are better than 
the ones given in Fig. HI This reinforces the confidence we 
have in our emulator. 



6 Sensitivity and Uncertainty Analyses 

In Section 5 we have presented the process to create a sta- 
tistical emulator that can predict the simulator's output with 
sufficient accuracy, for a number of different locations along 
the shoreline. Therefore, the emulator can be used in place 
of the expensive-to-run simulator to efficiently perform anal- 
yses that require a large number of evaluations, in order to 
save time without sacrificing accuracy. In this Section, we 



Fig. 8: Root Mean Square Error vs. mean CI length. Differ- 
ent types and colors represent different locations along the 
shoreline. 



demonstrate a sensitivity and uncertainty analyses using the 
emulator. 



6.1 Sensitivity analysis 

The statistical emulator is used to carry out a sensitivity anal- 
ysis of the model, where we investigate how sensitive is the 
maximum wave elevation for t < 35 to changes in inputs. 
Additionally, we examine whether the individual locations 
along the shoreline present consistent sensitivity to inputs' 
variation. 

Fig. [9] displays the case for the location (x,y) — (0,8.38). 
In each of the three plots, the maximum elevation is plotted 
against the initial position xo, speed uo and spread ratio c of 
the landslide, respectively, with the other two input parame- 
ters being kept constant. To ensure maximum emulator's ac- 
curacy and keep RMSE to the minimum, the input domain in 
sensitivity analysis is chosen to be the subset of the whole do- 
main where the mean Euclidean distance between the points 
are small as presented in Fig. B] Specifically, we consider 
x Q e [-2,0], uo € [1,2] and ce [0.5,2.5]. 

From Fig. |9a]we can see an obvious relationship between the 
landslide's speed and the maximum elevation. Specifically, 
a landslide with a larger uo gives larger maximum sea free- 
surface elevations. No strong dependency of the maximum 
elevation on initial position and spread ratio can be observed. 



Figure 9b highlights the positive relationship between uq and 



the maximum elevation, with the larger the u$, the larger the 
maximum elevation. Finally, Fig. [9c] shows that a landslide 
initiating from a subaerial position shows larger maximum 
sea free-surface elevations compared to a landslide starting 
from the origin. So, a relationship between the xq value and 
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the maximum elevation is indicated. Also, a landslide mov- 
ing with a larger speed yields larger maximum elevations. 
Moreover, we cannot say that the spread ratio is a significant 
factor at the specific range investigated. The same conclu- 
sions result by repeating the sensitivity analysis for the other 
six locations. We could easily perform similar analyses in 
which the output is another important aspect of the tsunami, 
different from the maximum elevation. 

A comparison of how sensitive is the maximum wave eleva- 
tion at different locations to changes in the input parameters 
is showed in Fig. 10 11 and 12 Each of the figures illustrate 



the change in maximum sea free-surface elevation with re- 
spect to variations in one of the input parameters, keeping the 
other two constant. We look at four different combinations 
of the constant parameters. We conclude that the sensitivity 
of maximum elevation is very similar for all the investigated 
locations along the shoreline. 




(a) 



(b) 



(b) max. elevation w.r.t. speed for two different initial positions 
and spread ratios 
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(c) max. elevation w.r.t. spread ratio for two different initial 
positions and speeds 

Fig. 9: Maximum sea free-surface elevation with respect to 
(a) initial position, (b) speed and (c) shape, for the time in- 
terval [0,35] and position (x,y) = (0,8.38). 




(O 



(d) 



Fig. 10: Maximum sea free-surface elevation with respect to 
initial position for (a) (uq,c) — (1,0.5), (b) (uo,c) — (1,2.5), 
(c) (u ,c) = (2,0.5) and (d) (u ,c) = (2,2.5), for the time 
interval [0,35]. 

Overall, the conclusions reached by using the emulator are 
the same as those obtained using the simulator as shown in 
Fig. [3] However, the emulator has the fundamental advan- 
tage that it is much faster compared to the simulator. There- 
fore, it can be evaluated at a much larger number of inputs, 
leading to higher resolution and smoother plots. Figure [9] 
plots required a large number of emulator evaluations, specif- 
ically 2012. Importantly, the required emulator running time 
is very short. A total time for this entire analysis for a specific 
location was around 186.6 seconds on a Dual Core 3.06GHz 
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(a) 



(b) 
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(a) 
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Fig. 11: Maximum sea free-surface elevation with respect 
to landslide's speed for (a) (xq,c) — (—2,0.5), (b) (xq,c) = 
(-2,2.5), (c) (x ,c) = (0,0.5) and (d) (x ,c) = (0,2.5), for 
the time interval [0,35]. 



(c) 



(d) 



Fig. 12: Maximum sea free-surface elevation with respect 
to landslide's spread ratio for (a) (xq,uq) = (—2,1), (b) 
(x ,u Q ) = (-2,2), (c) (x ,uo) = (0,1) and (d) (x ,u ) = 
(0,2), for the time interval [0,35]. 



computer. Using a simulator to perform the same analysis 
would take much longer, as a single run to reconstruct the 
sea free-surface elevation time series up to time 35 with the 
SR analytical model takes about 30 minutes. 

6.2 Uncertainty Analysis 

Usually the largest amount of uncertainty induced in simula- 
tor evaluations comes from the high uncertainty of tsunami 
trigger features. It is impossible to know exactly the initial 
position, speed and spread ratio of the landslide that cause 
the tsunami. Since, as we have shown, the emulator can pro- 
vide accurate enough predictions of the simulator's outputs, 
an uncertainty analysis is performed by employing the emu- 
lator in the place of the simulator. The uncertainty analysis 
will give us the amount of uncertainty in the predictions that 
is due to the uncertain inputs, as well as from the use of em- 
ulator in place of the simulator. Usually experts have some 
knowledge about the most likely distribution of the inputs. 
Using these distributions, one can draw a number of random 
input samples, that can be given to the emulator in order to 
estimate the posterior distribution of key tsunamis features 
(e.g. maximum elevation). 

We assume that some collection of emergency management 
experts (in landslides or in real-time remote sensing) come to 
the conclusion that the inputs follow a beta distribution with 
some skewness and that the input domain is the same as with 



the sensitivity analysis. The beta distribution is a flexible 
distribution over a finite interval that can enable experts to 
express their believes. The distributions of input parameters 
are given by 

x -Be(5,2) for x € [-2,0] (28) 

u ~Be(2,5) for u €[l,2] (29) 

c-Be(2,5) for c€ [0.5,2.5] (30) 

More specifically, the initial position of the landslide follows 
a distribution that indicates that a starting position near the 
origin is more likely. Both the speed and spread ratio distri- 
butions are skewed to the left, in order to highlight landslide's 
speeds most likely close to one and characteristic length and 
width of the landslide to be most likely of similar dimen- 
sions. 

For this analysis we draw one thousand random samples for 



the inputs from the distributions given in (28i, (29 1, (30 1 



resulting in the prior input distributions shown as histograms 
in Fig. [13] 

We run the emulator using the selected inputs and therefore, 
we get one thousand predictions for the wave elevation at a 
fixed position along the shoreline for times up to 35 at 0.2 
intervals. From each of these time series, the maximum el- 
evation and the mean CI length are estimated, resulting in 
one thousand estimates for each one. The variation among 
the thousand values are quantified using quantiles. The same 
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Fig. 13: Histograms showing the prior knowledge about the 
distribution of input points. 



process is repeated for all the examined locations along the 
shoreline. The quantiles for the case of (x,y) = (0,8.38) 
are summarized in Table [T] The posterior distribution of the 
maximum elevation is plotted in Fig. [14] This information 
summarizes the expected tsunami wave elevation and the as- 
sociated uncertainty in prediction. 





1% 


5% 


50% 


95% 


99% 


maximum elevation 
mean CI length 


0.92 

0.28 


1.03 
0.40 


1.66 

0.66 


2.18 
0.90 


2.35 
1.03 



Table 1 : Maximum elevation and mean CI length percentiles 
for the position (x,y) = (0,8.38). 

Therefore, for a tsunami wave caused by the postulated 
landslide features, we are 95% confident that the resulting 
tsunami wave will have maximum elevation less than 2.18, 
and 99% confident that it will be less than 2.35, looking at 
a location along the shoreline and far away from the source 
(y = 8.38). The same analysis can be performed similarly 
for other locations along the shoreline. Again the ability of 
the emulator to make predictions almost immediately is high- 
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Fig. 14: Output distribution for maximum wave elevation at 
the location (x,y) = (0,8.38). 



lighted in this case, since the total running time was just 83.9 
seconds for 1000 runs at each of the locations compared to 
30 minutes on the same computer for a single run of the SR 
tsunami model. 



7 Conclusions 

A statistical emulator of the analytical landslide-generated 
tsunami model developed by Sammar co and Renzi| ( |2008| l 
has been obtained using the Outer Product Emulator. This 
surrogate model is built using a combination of prior knowl- 
edge about the simulator, appropriate choices of functions 
and parameters and a limited number of simulator evalua- 
tions. The simulator is computationally expensive to evalu- 
ate, while the emulator produces estimates almost instanta- 
neously. However, since the emulator is an approximation 
of the simulator an additional error is induced in predictions. 
But this amount of error can be estimated, since the predic- 
tions of the emulator are given as statistical distributions, not 
just values. Moreover, an accurate enough emulator repre- 
sents the actual model with an almost negligible error. 

The emulator can be used for sensitivity and uncertainty anal- 
ysis of the simulator, since these analyses are almost im- 
possible to perform using the simulator. We have demon- 
strated these two analyses and the potential for reducing sig- 
nificantly the computational time. Where the emulator re- 
quires 83.9 seconds to get a thousand evaluations, the simu- 
lator requires 30 minutes for a single evaluation. Therefore, 
in critical situations where early warnings are necessary, an 
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emulator can be a life saver by providing accurate prediction 
in a very short time. 

There are several possible avenues for extensions of this 
work. First, in this paper we only examined the wave mo- 
tion at specific positions in space. To describe the space-time 
variations of the tsunami wave using an emulator, one needs 
to choose an enhanced formulation that includes spatial cor- 
relations of the outputs. This is a logical step but requires sta- 
tistical expertise. Secondly, the source (landslide here) is still 
not realistic and prior expert knowledge could be included 
in a more factual way on a case study. Finally, more de- 
tailed simulations using more advanced physical-based mod- 
els with a complex bathymetry need to be carried out to pro- 
vide better quantifications of the subsequent sea free-surface 
elevations as well as more accurate run-ups on the shore with 
the help of a detailed orography. 
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